A short description of the post.
packages = c('DT','ggiraph','plotly','tidyverse', 'raster','sf','clock','tmap',
'rgdal','dplyr', 'tidyr', 'textclean', "plotly", "forcats", "jpeg", "tiff",
"mapview","tidygraph","igraph","ggraph","visNetwork","leaflet","lubridate")
for(p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
car <- read_csv("data/mc2/car-assignments.csv")
cc <- read_csv("data/mc2/cc_data.csv", locale = locale(encoding = "ASCII"))
gps <- read_csv("data/mc2/gps.csv")
loyalty <- read_csv("data/mc2/loyalty_data.csv",locale = locale(encoding = "ASCII"))
First, we will customize the all code chunks using the below knitr code. More information on chunk options can be found here
Next, we will install the required R packages. There are three basic groups of packages that we will install,
For data manipulation and preparation The tidyverse package is a group of R packages including dplyr, tidyr that assist user to manipulate data.
Date and Time The two packages clock and lubridate are used for the manipulation of date and time data
Interactive data analysis The two packages ggiraph and plotly are used to output data into interactive graphical/chart form for analysis.
GeoVisual Analysis The packages raster, sf,tmap and rgdal are used for geospatial visual analytics where data are output to a map for analysis such as movement of people etc.
Network Analysis Lastly, to analyse the relationship between people etc. We will use the packages from tidygraph, igraph, ggraph and visNetwork. visNetwork is a package to output interactive network analysis.
packages = c('DT','ggiraph','plotly','tidyverse', 'raster','sf','clock','tmap',
'rgdal','dplyr', 'tidyr', 'textclean', "plotly", "forcats", "jpeg", "tiff",
"mapview","tidygraph","igraph","ggraph","visNetwork","leaflet","lubridate")
for(p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
There are 3 different types of data & information provided. A geospatial dataset, csv files and a jpg file consisting of the Abila tourist map.
Below is the information of all the data provided.
Employee Last Name
Employee First Name
Car ID (integer)
Current Employment Type (Department; categorical)
Current Employment Title (job title; categorical)
3.1 Timestamp
3.2 Car ID (integer)
3.3 Latitude
3.4 Longitude
4.1 Timestamp
4.2 Location (name of the business)
4.3 Price (real)
4.4 Loyalty Number (A 5-character code starting with L that is unique for each card)
5.1 Timestamp
5.2 Location (name of the business)
5.3 Price (real)
5.4 Last 4 digits of the credit or debit card number
We will import the 4 different csv datasets that were provided
car <- read_csv("data/mc2/car-assignments.csv")
cc <- read_csv("data/mc2/cc_data.csv")
gps <- read_csv("data/mc2/gps.csv")
loyalty <- read_csv("data/mc2/loyalty_data.csv")
If we take a look at the above datasets in excel, we will see that there are foreign characters in some of the datasets provided. An example is the Katerina’s Cafe as shown below. To address this, we will need to encode the dataset to allow rstudio to read properly.

To allow use to know the encoding type for both cc and loyalty dataset. Guess encoding will be used to detect the encoding type as shown below..
guess_encoding(cc)
# A tibble: 1 x 2
encoding confidence
<chr> <dbl>
1 ASCII 1
guess_encoding(loyalty)
# A tibble: 1 x 2
encoding confidence
<chr> <dbl>
1 ASCII 1
Next, we will add the code locale = locale(encoding = “ASCII” on both cc and loyalty dataset)
car <- read_csv("data/mc2/car-assignments.csv")
cc <- read_csv("data/mc2/cc_data.csv", locale = locale(encoding = "ASCII"))
gps <- read_csv("data/mc2/gps.csv")
loyalty <- read_csv("data/mc2/loyalty_data.csv",locale = locale(encoding = "ASCII"))
First, we will look at both cc and loyalty card dataset by using the glimpse function as shown below. There are 1490 rows and 1392 rows in both the cc and loyatly dataset respectfully. If we take look closely, we will see that these two datasets are closely linked by location, price and timestamp except the last4ccnum and loyaltynum are different.
If we take a look back at the MC2 background, we will observe that Kronos based companies are allowed to collect credit card and loyalty cards information on GAStech employees purchases as such these two datasets are similar in nature.
However, the rows for both cc and loyalty card data are different. This anomaly might have a few reasoning, 1) the employees did not used their credit cards while doing purchases but loyalty card was presented. 2) Vice versa, employees might also used their credit card but did not present their loyalty card during purchases.
glimpse(cc)
Rows: 1,490
Columns: 4
$ timestamp <chr> "1/6/2014 7:28", "1/6/2014 7:34", "1/6/2014 7:35"~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(loyalty)
Rows: 1,392
Columns: 4
$ timestamp <chr> "1/6/2014", "1/6/2014", "1/6/2014", "1/6/2014", "~
$ location <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
Next, we will look at the vehicle datasets. The 44 rows car datasets are represented by the employee, employment information with their car assignment ID.
The gps datasets are based on the car movements in respect to their lat and long position with timestamp.
glimpse(car)
Rows: 44
Columns: 5
$ LastName <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
glimpse(gps)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "1/6/2014 6:28", "1/6/2014 6:28", "1/6/2014 6:28",~
$ id <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
We will prepare the cc and loyalty datasets for data exploration later.
The last4ccnum of the cc datasets and the loyaltynum of the loyalty dataset should be a categorical data type. As such, we will change it by using the as.factor function.
Next, we will modify the datatype for both the timestamp of cc and loyalty dataset using the clock package. If we observe below, the data_time_parse function is use to change the timestamp to dttm (datetime) format while the date_parse function is used to change the data to date format.
cc$timestamp <- date_time_parse(cc$timestamp,
zone = "",
format = "%m/%d/%Y %H:%M")
loyalty$timestamp <- date_parse(loyalty$timestamp,
format = "%m/%d/%Y")
We will double check the dataset to confirm that the datatype has been changed to the one we wanted.
glimpse(cc)
Rows: 1,490
Columns: 4
$ timestamp <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(loyalty)
Rows: 1,392
Columns: 4
$ timestamp <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <fct> L2247, L9406, L8328, L6417, L1107, L4034, L6110, ~
To allow us to dive deeper into our analysis later, we will split the timestamp dataset into date, time and day columns. We will first add more columns by using the mutate function from dplyr to add day, date and time columns in the cc dataset and importing in to the cc_dtsplit object.
There is no need to add columns for the loyatly dataset as we will join both the datasets together in which day will be included in the joined dataset.
cc_dtsplit <- cc %>%
mutate(day = date_weekday_factor(cc$timestamp), date = as_date(cc$timestamp), time = format(cc$timestamp, format = "%H:%M"))
cc_dtsplit
# A tibble: 1,490 x 7
timestamp location price last4ccnum day date
<dttm> <chr> <dbl> <fct> <ord> <date>
1 2014-01-06 07:28:00 Brew've Been~ 11.3 4795 Mon 2014-01-06
2 2014-01-06 07:34:00 Hallowed Gro~ 52.2 7108 Mon 2014-01-06
3 2014-01-06 07:35:00 Brew've Been~ 8.33 6816 Mon 2014-01-06
4 2014-01-06 07:36:00 Hallowed Gro~ 16.7 9617 Mon 2014-01-06
5 2014-01-06 07:37:00 Brew've Been~ 4.24 7384 Mon 2014-01-06
6 2014-01-06 07:38:00 Brew've Been~ 4.17 5368 Mon 2014-01-06
7 2014-01-06 07:42:00 Coffee Camel~ 28.7 7253 Mon 2014-01-06
8 2014-01-06 07:43:00 Brew've Been~ 9.6 4948 Mon 2014-01-06
9 2014-01-06 07:43:00 Brew've Been~ 16.9 9683 Mon 2014-01-06
10 2014-01-06 07:47:00 Hallowed Gro~ 16.5 8129 Mon 2014-01-06
# ... with 1,480 more rows, and 1 more variable: time <chr>
To allow us to join both datasets together, we will need to rename the timestamp column from the loyatly dataset to date so that both the date columns have the same name.
loyalty_dt <- rename(loyalty, date = timestamp)
We will take a look at our newly cleaned datasets to double check the changed we have made previously.
glimpse(cc_dtsplit)
Rows: 1,490
Columns: 7
$ timestamp <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ day <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~
$ date <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ time <chr> "07:28", "07:34", "07:35", "07:36", "07:37", "07:~
glimpse(loyalty_dt)
Rows: 1,392
Columns: 4
$ date <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <fct> L2247, L9406, L8328, L6417, L1107, L4034, L6110, ~
First, we will take a look at the car & gps datasets. Notice that the CarID and id for both datasets are not of the correct categorical datatype. We will proceed to change both the two columns.
glimpse(car)
Rows: 44
Columns: 5
$ LastName <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
glimpse(gps)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "1/6/2014 6:28", "1/6/2014 6:28", "1/6/2014 6:28",~
$ id <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
Changing of CarID and ID to categorical data.
Next, we will combine both first and last name of the car datasets into one column for us to analyse the person easily. We will use the tidyr function unite to unify both the first and last name into name.
car_unite <- car %>%
unite(col = "name", LastName,FirstName, sep = ", ", remove =FALSE)
Next, we will rename the id of gps to CarID to match with the car_unite data. Additionally, the Timestamp data of gps will be changed to the dttm format
gps_cleaned <- rename(gps,CarID = id)
gps_cleaned$Timestamp <- date_time_parse(gps_cleaned$Timestamp,
zone = "",
format = "%m/%d/%Y %H:%M")
Lastly, we will look at our cleaned dataset and we have done cleaning the data.
glimpse(car_unite)
Rows: 44
Columns: 6
$ name <chr> "Calixto, Nils", "Azada, Lars", "Bala~
$ LastName <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
glimpse(gps_cleaned)
Rows: 685,169
Columns: 4
$ Timestamp <dttm> 2014-01-06 06:28:00, 2014-01-06 06:28:00, 2014-01~
$ CarID <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
As shown previosuly in our data preparation segment, there is a difference in rows (difference of 98 rows) between the cc and loyalty card data. This shows an anomaly that one of the cards were used but not both during purchases which might results in a difference in number of entries.
glimpse(cc_dtsplit)
Rows: 1,490
Columns: 7
$ timestamp <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ day <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~
$ date <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ time <chr> "07:28", "07:34", "07:35", "07:36", "07:37", "07:~
glimpse(loyalty_dt)
Rows: 1,392
Columns: 4
$ date <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <fct> L2247, L9406, L8328, L6417, L1107, L4034, L6110, ~
As explained previously, the cc and loyalty dataset are similar in nature with just an exception of the last4ccnum and loyaltynum, therefore, a join is required to match rows where the users uses both cc and loyalty card data. However, before a join is performed, we will analyse both the cc and loyalty dataset based on location counts.
A first exploratory data analysis by using a bar chart is performed on the cc_dtsplit dataset to find out the location which were patronize the most.
Based on the bar chart below, we are able to observe that there are 4 locations that are siginificantly higher in visitors than the rest of the locations. The highest is Katerina’s cafe with a total of 212 visits followed by Hippokampos, Guys’Gyros and Brew’s been served. These four places excluding Hippokampos which is not determinable in the map provided are all restaurants/bars.
cc_dtsplit_bar <- cc_dtsplit %>%
count(location) %>%
mutate(location = fct_reorder(location, n, .desc =TRUE)) %>%
plot_ly(x = ~location, y = ~n, marker = list(color = ~n)) %>%
add_bars() %>%
layout(title = "Total number of visitation by location", xaxis = list(title = ""),yaxis = list(title = "Number of visitors"))
cc_dtsplit_bar
Next, we will take a look at the loyalty_dt dataset.
Based on the bar chart below, we are able to observe that the 4 locations that were visited mostly are the same four locations that was shown on the cc_dtsplit dataset.
One difference is that Ouzeri Elian jumps to the fifth spot which differs from the cc_dtsplit dataset where Ouzeri Elian is in the sixth spot.
loyalty_dt_bar <- loyalty_dt %>%
count(location) %>%
mutate(location = fct_reorder(location, n, .desc =TRUE)) %>%
plot_ly(x = ~location, y = ~n, marker = list(color = ~n)) %>%
add_bars() %>%
layout(title = "Total number of visitation by ", xaxis = list(title = ""),yaxis = list(title = "Number of visitors"))
loyalty_dt_bar
Next, an inner join will be conducted by joining date, location and price and will will output into a new object cc_loyalty_join.
Notice that there are only 1,087 rows wheareas both cc_dtsplit and loyalty_dt has 1490 and 1392 rows respectfully. These shows that some of the data in cc does not match the loyalty data which strengthens the initial anomaly that some of the employees uses one of each card but not both during purchases.
cc_loyalty_join <- cc_dtsplit %>%
inner_join(loyalty_dt, by = c("date","location", "price"))
cc_loyalty_join
# A tibble: 1,087 x 8
timestamp location price last4ccnum day date
<dttm> <chr> <dbl> <fct> <ord> <date>
1 2014-01-06 07:28:00 Brew've Been~ 11.3 4795 Mon 2014-01-06
2 2014-01-06 07:35:00 Brew've Been~ 8.33 6816 Mon 2014-01-06
3 2014-01-06 07:36:00 Hallowed Gro~ 16.7 9617 Mon 2014-01-06
4 2014-01-06 07:37:00 Brew've Been~ 4.24 7384 Mon 2014-01-06
5 2014-01-06 07:38:00 Brew've Been~ 4.17 5368 Mon 2014-01-06
6 2014-01-06 07:43:00 Brew've Been~ 9.6 4948 Mon 2014-01-06
7 2014-01-06 07:43:00 Brew've Been~ 16.9 9683 Mon 2014-01-06
8 2014-01-06 07:47:00 Hallowed Gro~ 16.5 8129 Mon 2014-01-06
9 2014-01-06 07:48:00 Hallowed Gro~ 10.7 3492 Mon 2014-01-06
10 2014-01-06 07:49:00 Coffee Camel~ 8.39 5921 Mon 2014-01-06
# ... with 1,077 more rows, and 2 more variables: time <chr>,
# loyaltynum <fct>
We will examine the two vehicle related data. Gps and car-assignments data.
The car-assignment data with a total of 44 rows consists of the name and appointment of the employee tag to a CarID.
The GPS data with a total of 685169 rows consists of the id of the car and its movement based on latitude and longitude with timestamp.
glimpse(car_unite)
Rows: 44
Columns: 6
$ name <chr> "Calixto, Nils", "Azada, Lars", "Bala~
$ LastName <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
glimpse(gps_cleaned)
Rows: 685,169
Columns: 4
$ Timestamp <dttm> 2014-01-06 06:28:00, 2014-01-06 06:28:00, 2014-01~
$ CarID <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
In order for us to match the car assignment person to the gps data. We will inner join the data.
Notice that there is only a total of 613077 dataset. This shows that some of the vehicles recorded are not part of the car_assignment data. I.e. vehicles apart from the car assignees has been tracked too. This phenomena wil be explained in the next section.
gps_car <- gps_cleaned %>%
inner_join(car_unite, by = "CarID")
gps_car
# A tibble: 613,077 x 9
Timestamp CarID lat long name LastName FirstName
<dttm> <fct> <dbl> <dbl> <chr> <chr> <chr>
1 2014-01-06 06:28:00 35 36.1 24.9 Vasco-Pai~ Vasco-P~ Willem
2 2014-01-06 06:28:00 35 36.1 24.9 Vasco-Pai~ Vasco-P~ Willem
3 2014-01-06 06:28:00 35 36.1 24.9 Vasco-Pai~ Vasco-P~ Willem
4 2014-01-06 06:28:00 35 36.1 24.9 Vasco-Pai~ Vasco-P~ Willem
5 2014-01-06 06:28:00 35 36.1 24.9 Vasco-Pai~ Vasco-P~ Willem
6 2014-01-06 06:28:00 35 36.1 24.9 Vasco-Pai~ Vasco-P~ Willem
7 2014-01-06 06:28:00 35 36.1 24.9 Vasco-Pai~ Vasco-P~ Willem
8 2014-01-06 06:28:00 35 36.1 24.9 Vasco-Pai~ Vasco-P~ Willem
9 2014-01-06 06:28:00 35 36.1 24.9 Vasco-Pai~ Vasco-P~ Willem
10 2014-01-06 06:28:00 35 36.1 24.9 Vasco-Pai~ Vasco-P~ Willem
# ... with 613,067 more rows, and 2 more variables:
# CurrentEmploymentType <chr>, CurrentEmploymentTitle <chr>
The car_unite dataset which consists of car assigned to GASTech employees and also truck drivers which have no CarID.
DT::datatable(car_unite)
Comparing the above car_unite DT table with the below gps_cleaned table, you will observe that there are vehicle IDs of 101-107. Those are assume to be trucks.
DT::datatable(gps_cleaned)